Language generation and personality: two dimensions, two stages, two hemispheres?

نویسندگان

  • Jon Oberlander
  • Alastair Gill
چکیده

We are interested in generating text in a way which helps convey the writer’s personality. This has led us to consider the relationship between language production and personality from a Marrian perspective. We already have data to be covered at the computational level (comparative corpus analysis). We consider that findings at the implementation level (cognitive neuroscience) will help guide architectural explorations at the algorithmic level (computational linguistics). This position statement indicates the data and processing hypotheses which we have arrived at, and suggests that neurocognitive results concerning hemispheric asymmetry may be particularly relevant. Personality and language production Personality traits lie at the more temporally-stable and less intense end of scale of affective states and processes. There are a number of approaches to personality (Matthews and Deary, 1998). Two of the most prominent trait theories are the five factor model (McCrae and Costa, 1987), and Eysenck’s three-factor PEN model (Eysenck and Eysenck 1991, Eysenck et al. 1985). These agree that two main factors are Extraversion (sociability) and Neuroticism (emotional stability). The Five Factor Model sees three further dimensions: Conscientiousness, Agreeableness and Openness; PEN arguably conflates these into one dimension, Psychoticism (tough mindedness). In what follows, we focus on the first two dimensions, common to both models. In the past, simple approaches to our generation task have involved two steps. First, checking the literature on individual differences and language production (Pennebaker and King 1999, Berry et al. 1997, Groom and Pennebaker 2003, Campbell and Pennebaker 2003, Furnham 1990, Dewaele and Furnham 1999, Dewaele and Furnham 2000, Dewaele and Pavlenko 2002, Scherer 1979). Secondly, picking a number of features associated with a personality trait, and then ensuring that they either always or never appear in a language generation system’s output. For instance, Nass et al. (1995) manipulated dominance (a facet related to Extraversion) by avoiding hedge-expressions such as perhaps, and ensuring that the system initiated pairs of turns. Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. However, we are interested in text generation, and the great majority of work on language in personality psychology has focussed on spoken language, and where it has considered written text, it has usually confined itself to counting occurrences in a text of words listed in a pre-defined dictionary (eg. Pennebaker and Francis 1999, Pennebaker and King 1999). (Although Dewaele has gone further, and analysed part-of-speech and lemmatised word frequencies.) Yet there’s clearly more to language generation than lexical choice. So one obvious way of improving systems designed to convey personality is to use more sensitive techniques to detect subtle yet pervasive language-personality patterns. That is exactly what we have been doing (while also testing how good judges are at perceiving personality from texts sampled from these corpora; cf. Gill and Oberlander 2003b). We have exploited more sensitive data-driven techniques from corpus linguistics, and compared n-grams (of words and punctuation, and additional meta-linguistic information) of various lengths, as well as part of speech and semantic analysis and psycholinguistic measures on word use (Rayson 2003, Argamon et al. 2003, Aarts and Granger 1998, Milton 1998, Thomas and Wilson 1996, Rayson et al. 1997, Damerau 1993, Coltheart 1981). We have applied these techniques to corpora collected from subjects whose personality is measured via Eysenck’s EPQ instrument (cf. Dewaele and Furnham 2000, Dewaele and Pavlenko 2002). This has allowed us to gather and analyse a corpus of email messages (amounting to 65,000 words from 105 subjects). The techniques and tools from computational corpus linguistics have allowed us to uncover more subtle relations between personality and language than has hitherto been possible (Gill and Oberlander 2002, 2003a, 2003b, 2003c; Gill 2003). Language data to be explained We have uncovered numerous surface cues to Extraversion, Neuroticism and Psychoticism; to indicate the kind of work we have been carrying out, this section briefly rehearses some of the features which appear to vary by these dimensions. On the one hand, we carried out dictionary-based top-down regression analysis of our corpus of e-mail texts. On the other, we carried out bottom-up comparative analysis of sub-corpora, to isolate patterns of words (or parts of speech) that were distinctive of personality types. We will touch briefly on the first type of analysis, and go into more detail on the second. Results of top-down analyses A series of multiple regression analyses were carried out on the corpus, relating personality scores to prevalence of terms in either Pennebaker and Francis’ LIWC dictionary, or in the MRC Psycholinguistic database (Coltheart 1981). Taking Extraversion and the LIWC dictionary first, comparing higher with lower Extraversion, we found fewer number expressions and more words overall (R = .08, p < 0.05). With the MRC dictionary, we found lower concreteness overall (R = .05, p < 0.05). The former result fits with the general finding that Extraverts speak more, and are generally less precise. The latter finding suggests that they also prefer less specific, more abstract language. This would fit the idea that the need to seize or maintain the conversational floor leads to high Extraverts putting less effort into precise lexical choice. See Gill and Oberlander (2002) for more details. Turning to Neuroticism, using LIWC again and comparing higher with lower Neuroticism, we found more ‘Inclusive’ words and more first person references (R = .11, p < 0.01). The use of inclusives like with, and and include is arguably consistent with a desire for attachment, and the use of first person with a preoccupation with the self. With the MRC dictionary, we found higher concreteness overall, and higher mean verbal frequency (R = .14, p < 0.001). This suggests fairly down-to-earth lexical choices, and language that is more speech-like or immediate, overall. The latter feature is consistent with another of our findings, to the effect that higher Neuroticism is associated with lower lexical density (and hence, repetitiveness). See Gill and Oberlander (2003b) for more details. Results of bottom-up analyses The original e-mail corpus of texts was divided into stratified sub-corpora. High and Low personality group samples were created by splitting them at greater than 1 standard deviation above and below the EPQ-R score for each dimension. The additional requirement was made that authors had to be within 1 standard deviation on the dimensions other than the one for which they were extremely high or low. Additionally, all texts which were within 1 standard deviation across all personality dimensions were assigned to the personality ‘neutral’ Mid sub-corpus. Thus, on any dimension, we have three groups to compare (High, Mid, and Low). The primary goal is to identify words (unigrams) or strings of words (n-grams) which form reliable collocations for one group, but not for another; these can then be considered distinctive collocations. Here we present the results from the three-way lemmatised analysis for Extraversion and Neuroticism, in Tables 1 and 2. By lemmatising (or stemming), minor variants of words can be collapsed together, increasing the power of the analysis. In such a processed corpus words such as play, plays, played, or playing, are all realised in the base form of the verb: play. More importantly, in our data there are instances of proper nouns being used, for example, names of places (Edinburgh), days of the week (Saturday), or names of people (Dave), with these providing too much specificity to allow broader patterns of language usage to emerge, or for the results to be easily generalised. The corpora were pre-processed using the CLAWS tagger (Rayson 2003) to give vertical-output lemmatised words and part-of-speech (POS) tags. Additional scripts were then used to convert this into the form of lemmas, and in the case of the features being a proper noun, this was replaced by the POS tag. To identify robust collocations in the sub-corpora, and then to identify those which distinguish one group from another, we start by specifying that a feature should exhibit a frequency in one of the three groups of at least 5 occurrences, and ordering the features by log-likelihood (G) value. Because we only examine expected frequencies of 5 or more—which compare more reliably with the χ distribution—we can here present results with a critical value of 10.83 or greater, taking this to be equivalent to reaching p ≤ 0.001 significance, and those results with a critical value of 15.13 or greater are taken to be equivalent to reaching p ≤ 0.0001 significance (cf. Rayson 2003 on adjustments which have to made if frequencies of less than 5 are to be considered). Note that if a feature is overused by the Mid group, we do not report the G for this, and in cases where the relative-frequency ratio or G is not available, we replace this by ‘-’. Tables 1 and 2 contain a lot of low-level data. Note that a feature (such as the collocation [will be]) may be underused by one sub-group, compared to the two other groups, or over-used by one group compared to the others. To help characterise the linguistic habits of a group at one or other end of a personality dimension, we can consider both which n-grams they over-use, and also which n-grams are underused by the group at the other end of the dimension. Figure 1 presents just such a digest, for Extraversion and Neuroticism. Putting the content of Figure 1 into other words, we can say that there are a number of reliable collocations which appear to be distinctive of the personality groups under discussion. Punctuation is surprisingly differentiated. Multiple punctuation (exclamation in particular, but also the multiple dots of ellipsis) is particularly associated with High-N, and also with High-E. Single hyphens are associated with Low-E; commas with Low-N. Several collocations involving the first person singular are apparent for High-E, and a couple for Low-E ([i play], [that i]); for High-N, we find [well i] and [i ca], where the latter lemmatised bigram represents the initial subpart of I can’t or I couldn’t. There are none for Low-N. Interestingly, both Highand Low-N use first person plural less than the Mid reference group. Expressions concerning ability or modality appear in different patterns for the groups. High-E have [i will] and [will have]; Low-E have [be supposed to be]; High-N have [i ca] and [have to]; Low-N have [will be], [have be], and [have not]. NPs appear in distinctive collocations for some groups. Feature Rank High High Mid Mid Low Low High-Mid Low-Mid High-Low High-Mid Low-Mid High-Low High Mid Low Freq. R.Freq. Freq. R.Freq. Freq. R.Freq. R.F. Ratio R.F. Ratio R.F. Ratio G2 G2 G2 Use Use Use play 1 3 0.0004 2 0.0002 14 0.0052 2.42 30.31 0.08 0.97 35.63**** 22.53**** + get a 2 15 0.0021 0 0 5 0.0019 1.12 28.86**** 16.74**** 0.05 − be so 3 14 0.0020 0 0 0 0 26.93**** 8.88** + i play 4 0 0 0 0 7 0.0026 23.43**** 18.24**** + christmas 〈p〉 5 10 0.0014 0 0 3 0.0011 1.24 19.24**** 10.04** 0.11 − year 〈p〉 5 10 0.0014 0 0 0 0 19.24**** 6.34* + week 〈p〉 6 0 0 42 0.0036 7 0.0026 0.72 0.69 18.24**** − i will 7 28 0.0039 35 0.0030 0 0 1.29 1.02 17.76**** − 〈p〉 take 8 9 0.0013 0 0 0 0 17.31**** 5.71* + with i 8 9 0.0013 0 0 0 0 17.31**** 5.71* + be supposed 9 0 0 0 0 5 0.0019 16.74**** 13.03*** + that be 9 0 0 0 0 5 0.0019 16.74**** 13.03*** + bread 10 0 0 3 0.0003 6 0.0022 8.66 9.87** 15.63**** + NP1 for 11 8 0.0011 0 0 0 0 15.39**** 5.07* + i really 11 8 0.0011 0 0 0 0 15.39**** 5.07* + then i 12 7 0.0010 0 0 3 0.0011 0.87 13.47*** 10.04** 0.04 − day 〈p〉 12 7 0.0010 0 0 0 0 13.47*** 4.44* + will have 12 7 0.0010 0 0 0 0 13.47*** 4.44* + NP1 and 13 21 0.0029 22 0.0019 0 0 1.54 2.00 13.32*** − be supposed to 14 0 0 1 0.0001 5 0.0019 21.66 11.75*** 13.03*** + be supposed to be 14 0 0 1 0.0001 5 0.0019 21.67 11.75*** 13.03*** + supposed to be 14 0 0 1 0.0001 5 0.0019 21.66 11.75*** 13.03*** + supposed 14 0 0 1 0.0001 5 0.0019 21.65 11.74*** 13.03*** + supposed to 14 0 0 1 0.0001 5 0.0019 21.66 11.74*** 13.03*** + fairly 14 0 0 4 0.0003 5 0.0019 5.41 6.03* 13.03*** + 〈p〉 although 14 0 0 7 0.0006 5 0.0019 3.09 3.34 13.03*** − that i 14 0 0 27 0.0023 5 0.0019 0.80 0.22 13.03*** − and i 15 20 0.0028 44 0.0038 0 0 0.73 1.35 12.69*** − and NP1 16 19 0.0027 13 0.0011 0 0 2.36 5.84* 12.05*** + take 17 25 0.0035 13 0.0011 7 0.0026 3.11 2.33 1.33 11.79*** 2.93 0.48 − cool 〈p〉 18 6 0.0008 0 0 0 0 11.54*** 3.81 + from the 18 6 0.0008 0 0 0 0 11.54*** 3.81 + of it 18 6 0.0008 0 0 0 0 11.54*** 3.81 + today 〈p〉 18 6 0.0008 0 0 0 0 11.54*** 3.81 + what i 18 6 0.0008 0 0 0 0 11.54*** 3.81 + Table 1: Lemmatised n-gram analysis, Extraversion. Note. ∗ = p < .05, ∗∗ = p < .01, ∗ ∗ ∗ = p < .001, ∗ ∗ ∗∗ = p < .0001, df = 1. In Use columns, + indicates over-use, − indicates under-use. Feature Rank High High Mid Mid Low Low High-Mid Low-Mid High-Low High-Mid Low-Mid High-Low High Mid Low Freq. R.Freq. Freq. R.Freq. Freq. R.Freq. R.F. Ratio R.F. Ratio R.F. Ratio G2 G2 G2 Use Use Use 〈p〉 it 1 0 0 45 0.0039 38 0.0053 1.37 2.00 34.37**** − 〈p〉 〈p〉 〈p〉 〈p〉 〈p〉 2 16 0.0039 2 0.0002 0 0 22.67 31.66**** 32.37**** + NP1 〈p〉 3 16 0.0039 57 0.0049 0 0 0.80 0.68 32.36**** − 〈p〉 〈p〉 〈p〉 〈p〉 4 21 0.0051 10 0.0009 2 0.0003 5.95 0.32 18.37 23.50**** 2.65 30.70**** + 〈p〉 as 5 0 0 0 0 14 0.0020 26.97**** 12.66*** + 〈p〉 〈p〉 〈p〉 6 43 0.0105 56 0.0048 23 0.0032 2.18 0.67 3.27 13.89*** 2.85 22.43**** + film 7 11 0.0027 2 0.0002 0 0 15.58 19.60**** 22.25**** + i go 8 8 0.0020 0 0 8 0.0011 1.75 21.50**** 15.41**** 1.23 − that be 9 7 0.0017 0 0 11 0.0015 1.11 18.81**** 21.19**** 0.05 − 〈p〉 he 9 0 0 0 0 11 0.0015 21.19**** 9.95** + 〈p〉 well 10 21 0.0051 12 0.0010 10 0.0014 4.96 1.35 3.67 20.43**** 0.48 12.53*** + will be 11 0 0 39 0.0034 21 0.0029 0.87 0.26 18.99**** − have be 11 0 0 37 0.0032 21 0.0029 0.92 0.10 18.99**** − all the 12 9 0.0022 9 0.0008 0 0 2.83 4.67* 18.20**** + 〈p〉 so 13 0 0 34 0.0029 20 0.0028 0.95 0.03 18.09**** − be in 14 0 0 0 0 9 0.0013 17.34**** 8.14** + 〈p〉 〈p〉 well 15 12 0.0029 4 0.0003 6 0.0008 8.50 2.43 3.50 16.67**** 1.94 6.78** + film be 16 6 0.0015 0 0 0 0 16.12**** 12.13*** + the film 16 6 0.0015 0 0 0 0 16.12**** 12.13*** + well i 16 6 0.0015 0 0 0 0 16.12**** 12.13*** + year 〈p〉 17 0 0 0 0 8 0.0011 15.41**** 7.24** + to do 18 0 0 11 0.0009 17 0.0024 2.50 5.80* 15.37**** + to the 19 0 0 24 0.0021 16 0.0022 1.08 0.06 14.47*** − though 〈p〉 20 7 0.0017 12 0.0010 0 0 1.65 1.06 14.16*** − to NP1 20 7 0.0017 25 0.0022 0 0 0.79 0.31 14.16*** − we 21 18 0.0044 119 0.0103 47 0.0066 0.43 0.64 0.67 13.74*** 7.13** 2.21 + still 22 0 0 30 0.0026 15 0.0021 0.81 0.45 13.57*** − about it 23 0 0 0 0 7 0.0010 13.48*** 6.33* + it do 23 0 0 0 0 7 0.0010 13.48*** 6.33* + rowing 23 0 0 0 0 7 0.0010 13.48*** 6.33* + and she 24 5 0.0012 0 0 0 0 13.44*** 10.11** + the film be 24 5 0.0012 0 0 0 0 13.44*** 10.11** + the time 24 5 0.0012 0 0 0 0 13.44*** 10.11** + experiment 24 5 0.0012 0 0 2 0.0003 4.37 13.44*** 3.85* 3.54 − 〈p〉 which 25 0 0 15 0.0013 14 0.0020 1.51 1.22 12.66*** − have not 25 0 0 32 0.0028 14 0.0020 0.71 1.20 12.66*** − NP1 and 25 0 0 22 0.0019 14 0.0020 1.03 0.01 12.66*** − stuff 26 3 0.0007 3 0.0003 13 0.0018 2.83 7.02 0.40 1.56 12.48*** 2.38 + 〈p〉 〈p〉 we 27 4 0.0010 34 0.0029 5 0.0007 0.33 0.24 1.40 5.73* 12.45*** 0.25 + i ca 28 6 0.0015 10 0.0009 0 0 1.70 1.00 12.13*** − of time 29 3 0.0007 0 0 6 0.0008 0.87 8.06** 11.56*** 0.04 − get a 29 0 0 0 0 6 0.0008 11.56*** 5.43* + go on 29 0 0 0 0 6 0.0008 11.56*** 5.43* + party 〈p〉 29 0 0 0 0 6 0.0008 11.56*** 5.43* + stuff 〈p〉 29 0 0 0 0 6 0.0008 11.56*** 5.43* + have to 30 21 0.0051 30 0.0026 11 0.0015 1.98 0.59 3.34 5.47* 2.35 11.24*** + thesis 31 6 0.0015 1 0.0001 3 0.0004 16.99 4.86 3.50 10.99*** 2.24 3.39 + 〈p〉 the 32 0 0 16 0.0014 12 0.0017 1.21 0.26 10.85*** − he be 32 0 0 22 0.0019 12 0.0017 0.88 0.12 10.85*** − well 〈p〉 32 0 0 18 0.0016 12 0.0017 1.08 0.04 10.85*** − Table 2: Lemmatised n-gram analysis, Neuroticism. Note. ∗ = p < .05, ∗∗ = p < .01, ∗ ∗ ∗ = p < .001, ∗ ∗ ∗∗ = p < .0001, df = 1. In Use columns, + indicates over-use, − indicates under-use.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

رابطه استعاره‌‌ها و ابعاد شخصیتی درون‌گرایی/ برون‌گرایی آیسنک

The goal of the present research was to determine the relationship between Eysenck's "E" personality dimensions and a selection of metaphorical concepts. Researches in the past have emphasized personality and linguistic components in literal language. The present research investigated metaphor as a part of figurative language, and its relation with the two personality dimensions. The initial sa...

متن کامل

Personality Dimensions of Temperament and Character in Patients with Borderline Personality Disorder and Bipolar Disorder II

Introduction: The purpose of this study was to investigate and compare personality dimensions of temperament and character in patients with borderline personality and bipolar II disorders in order to evaluate the hypothesis of “belonging of borderline personality disorder in moods disturbance spectrum.” Materials and Methods: 43 patients with borderline personality disorder, 47 patients with bi...

متن کامل

The Effect of Emotionality and Openness to Experience on Vocabulary Learning Strategies of Iranian EFL Students

This study explored the relationship between vocabulary learning strategies and learner variables of Iranian learners of English as a foreign Language (EFL) with special reference to their personality types to examine what implications these associations have for teaching EFL. It tried to find any possible relation between vocabulary learning strategies use of Iranian EFL students and two perso...

متن کامل

The Impact of Language on Personality Ethic as a Social Paradigm

This study aimed to explore the role of language type in personality ethic- as a social paradigm. To do so, 30 Iranian advanced bilingual EFL university students were selected based on their performance on the OPT. Then, they were asked to respond to an ethical survey as modelled by Poulshock in two Persian and English versions at the time interval of one month. Their responses to both versions...

متن کامل

PERSONAGE: Personality Generation for Dialogue

Over the last fifty years, the “Big Five” model of personality traits has become a standard in psychology, and research has systematically documented correlations between a wide range of linguistic variables and the Big Five traits. A distinct line of research has explored methods for automatically generating language that varies along personality dimensions. We present PERSONAGE (PERSONAlity G...

متن کامل

نقش ابعاد سرشت - منش و سیستم‌های مغزی و رفتاری در پیش‌بینی ویژگیهای شخصیت خودشیفته

Background &&nbsp;Aims: The increased prevalence of Narcissistic personality and development of psychobiological approaches in recent decades has led to a shift in prevention and treatment of this disorder. So, current research is aimed at determining the role of temperament-character dimensions and brain-behavioral systems in predicting narcissistic personality characteristics. Material...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004